Assignment Instructions

Complete all questions below. After completing the assignment, knit your document, and download both your .Rmd and knitted output. Upload your files for peer review.

For each response, include comments detailing your response and what each line does. Ensure you test your functions with sufficient test cases to identify and correct any potential bugs.

Required Libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Question 1.

Identify the primary keys in the following datasets. Be sure to show that you have the primary key by showing there are no duplicate entries.

Lahman::Batting - Primary Keys are: playerID, yearID, stint, teamID babynames::babynames - Primary keys are: year, sex, name nasaweather::atmos - Primary keyws are: lat, long, year, month

# Load necessary libraries
library(Lahman)
library(babynames)
library(nasaweather)
## 
## Attaching package: 'nasaweather'
## The following object is masked from 'package:dplyr':
## 
##     storms
Batting <- Lahman::Batting
Batting %>%
  count(playerID, yearID, stint, teamID) %>%
  filter(n > 1)
babynames <- babynames::babynames
babynames %>%
  count(year, sex, name) %>%
  filter(n > 1)
nasaweather::atmos
atmos <- nasaweather::atmos
atmos %>%
  count(lat, long, year, month) %>%
  filter(n > 1)
Question 2.

What is the relationship between the “Batting”, “People”, and “Salaries” tables in the “Lahman” package? What are the keys for each dataset and how do they relate to each other?

Note “Master” Package has been renamed to “People”

The common key among all three tables is the playerID column. This column uniquely identifies each player across different tables. The “Batting” table contains statistics related to a player’s performance in various seasons. The “People” table provides additional personal information about players, coaches, and umpires, including their birthdates, birthplaces, and physical attributes. The “Salaries” table records the salary information for players across different teams and seasons.

names(Lahman::Batting)
##  [1] "playerID" "yearID"   "stint"    "teamID"   "lgID"     "G"       
##  [7] "AB"       "R"        "H"        "X2B"      "X3B"      "HR"      
## [13] "RBI"      "SB"       "CS"       "BB"       "SO"       "IBB"     
## [19] "HBP"      "SH"       "SF"       "GIDP"
names(Lahman::People)
##  [1] "playerID"     "birthYear"    "birthMonth"   "birthDay"     "birthCountry"
##  [6] "birthState"   "birthCity"    "deathYear"    "deathMonth"   "deathDay"    
## [11] "deathCountry" "deathState"   "deathCity"    "nameFirst"    "nameLast"    
## [16] "nameGiven"    "weight"       "height"       "bats"         "throws"      
## [21] "debut"        "finalGame"    "retroID"      "bbrefID"      "deathDate"   
## [26] "birthDate"
names(Lahman::Salaries)
## [1] "yearID"   "teamID"   "lgID"     "playerID" "salary"
Question 3.

Load the “nycflights13” library. Use an appropriate join to add a column containing the airline name to the “flights” dataset present in the library. Be sure to put the carrier code and name in the first two columns of the result so we can see them. Save the result as “flights2”.

# Load required libraries
library(nycflights13)

# Perform a left join to add airline names to flights
flights2 <- flights %>%
  left_join(airlines, by = c("carrier" = "carrier")) %>%
  select(carrier, name, everything())

# View the first few rows to confirm the result
head(flights2)
Question 4.

Use an appropriate join to add the airport name to the “flights2” dataset you got above. The codes and names of the airports are in the “airports” dataset of the “nycflights13” package. Put the carrier and carrier name first followed by the destination and destination name, then everything else.

flights2 <- flights2 %>%
  left_join(airports, by = c("dest" = "faa")) %>% 
  select(carrier, name.x, dest, name.y, everything())

flights2
Question 5.

The “nycflights13” library and the code to create spatial map is provided for you. Now compute the average delay by destination, then join on the airports dataframe so you can show the spatial distribution of delays.

Use the textbook for reference.

library(nycflights13)
library(maps)
## 
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
## 
##     map
airports
flights
# Add the location of the origin and destination (i.e. the lat and lon) to flights.
    # Join to add origin latitude and longitude
    flights_with_origin <- flights %>%
      left_join(airports %>% select(faa, origin_lat = lat, origin_lon = lon), by = c("origin" = "faa"))
    
    # Join to add destination latitude and longitude
    flights_complete <- flights_with_origin %>%
      left_join(airports %>% select(faa, dest_lat = lat, dest_lon = lon), by = c("dest" = "faa"))
    
    # Select and rename columns to keep the dataframe clean and informative
    flights_complete <- flights_complete %>%
      select(year:day, hour, origin, origin_lat, origin_lon, dest, dest_lat, dest_lon, everything())
    
    # View the first few rows to confirm the new columns
    head(flights_complete)
# Calculate average delay by destination
average_delays <- flights %>%
  group_by(dest) %>%
  summarise(avg_delay = mean(arr_delay, na.rm = TRUE))

average_delays
# Join average delays with airports data
airports_with_delays <- airports %>%
  inner_join(average_delays, by = c("faa" = "dest"))

airports_with_delays
#Use the size or colour of the points to display the average delay for each airport
# Create the map using ggplot
ggplot(airports_with_delays, aes(x = lon, y = lat)) +
  borders("state") +  # Add state borders
  geom_point(aes( color = avg_delay)) +  # Points sized & colored by delay
  scale_color_gradient(low = "green", high = "red") +  # Color gradient from green (low delays) to red (high delays)
  #scale_size(range = c(1, 10)) +  # Scale the size of points, handle NA values by setting them to smallest size
  labs(color = "Average Delay") +
  coord_quickmap() +
  theme_minimal() +
  ggtitle("Average Arrival Delay by Airport")

Question 6.

Use a set operation function to find which airport codes from flights are not in the airports dataset.

# Extract unique airport codes from flights
flight_airports <- unique(c(unique(flights$origin), unique(flights$dest)))

# Extract unique FAA codes from airports
airport_codes <- unique(airports$faa)

# Use setdiff to find codes in flights not in airports
missing_airports <- setdiff(flight_airports, airport_codes)

# Display missing airport codes
missing_airports
## [1] "BQN" "SJU" "STT" "PSE"